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Docxuaent Analysis Method to Detect BW/Color Areas and 
Corresponding Scanning Device 



Field of the Invention 

5 The present invention relates to a document analysis 
method and, more particularly, to a document analysis 
method to detect BW/color areas . 

Moreover, the invention relates to a scanning device to 
• acquire documents. 

10 Finally, the invention relates to a method for 
acquiring a document based on the analysis of the 
content of the document itself. 



Background of the Invention 

15 As is well known in the technical field of image 
processing, during its life an image is processed by a 
plurality of electronic devices, that create, acquire, 
display store, read and write the image itself. 

The image data processing device, and the corresponding 
2 0 processing method deal with an image acquired by means 
of an image acquisition device, for example a scanner. 

The image data so obtained are usually organized into a 
raster of pixels, each pixels providing an elementary 
image inf ormat ion . 

25 In other words, images are, at the most basic level, 
arrays of digital values, where a value is a collection 
of numbers describing the attributes of a pixel in the 
image. For example, in bitmaps, the above mentioned 
values are single binary digits. 
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Often, these numbers are fixed-point representation of 
a range of real number; for example, the integers 0 
through 255 are often used to represent the numbers 
from 0.0 to 1,0. Often too, these numbers represent the 
5 intensity at a point of the image (gray scale) or the 
intensity of one color component at that point. 

An important distinction has to be made in the images 
to be processed between achromatic and colored images. 

In fact, achromatic light has only one attribute, which 
10 is the quantity of light. This attribute can be 
discussed in the physic sense of energy, in which case 
the terms intensity and luminance are used, or in the 
psychological sense of perceived intensity, in which 
case the term brightness is used. 

15 It is useful to associate a scale with different 
intensity levels, for instance defining 0 as black and 
1 as white; intensity levels between 0 and 1 represent 
different levels of grays. 

The visual sensations caused by colored light are much 
20 more richer than those caused by achromatic light. 
Discussion on color perception usually involves three 
quantities, known as hue, saturation and lightness. 

1. Hue distinguishes among colors such as red, green, 
purple and yellow. 

25 2. Saturation refers to how far a color is from a 
gray of equal intensity. Red is highly saturated; pink 
is relatively unsaturated; royal blue is highly 
saturated; sky blue is relatively unsaturated. Pastel 
colors are relatively unsaturated; unsaturated colors 

3 0 include more white light than do the vivid, saturated 
colors . 

3 . Lightness embodies the achromatic notion of 

perceived intensity of a reflecting object, ^ 
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A fourth term, brightness, is used instead of lightness 
to refer to the perceived intensity of a self-luminous 
object (i.e. an object emitting rather than reflecting 
light), such as a light bulb, the sun or a CRT. 

5 The above mentioned features of colors seem to be 
subjective: they depend on human observers* judgment. 
In reality, the branch of physics known as color imetry 
provides for an objective and quantitative way of 
specifying colors, which can be correlated to the above 
10 perceptual classification. 

A color can be represented by means of its dominant 
wavelength, which corresponds to the perceptual notion 
of hue; excitation purity corresponds to the saturation 
of the color; luminance is the amount or intensity of 
15 light. The excitation purity of a colored light is the 
proportion of pure light of the dominant wavelength and 
of white light needed to define the color. 

A completely pure color is 100 % saturated and thus 
contains no white light, whereas mixtures of a pure 
20 color and white light have saturations somewhere 
between 0 and 100 White light and hence gray are 0 % 
saturated, contains no color of any dominant 
wavelength. 

Furthermore, light is fundamentally electromagnetic 
25 energy in the 400-700 nm wavelength part of the 
spectrum, which is perceived as the colors from violet 
through indigo, blue, green, yellow and orange to red. 
The amount of energy present at each wavelength is 
represented by a spectral energy distribution P(l), as 
30 shown in figure 1. 

The visual effect of any spectral distribution can be 
described by means of three values, i.e. the dominant 
wavelength, the excitation purity, and the luminance. 
Figure 2 shows the. spectral distribution of figure 1, 



' 60980086-2 

illustrating such three value, In particular, it should 
be noted that at the dominant wavelength there is a 
spike of energy of level e2 . White light , the uniform 
distribution of energy level el is also present, 

5 The excitation purity depends on the relation between 
el and e2 : when el=e2, excitation purity is O' %; when 
el=0, excitation purity is 100 %. 

Luminance, which is proportional to the integral of the 
area under such curve, depends on both el and e2 . 

10 A color model is a specification of a 3D color 
coordinate system and a visible subset in the 
coordinate system within which all colors in a 
particular range lie. For instance, the RGB (red, 
green, blue) color model is the unit cube subset of a 

15 3D Cartesian coordinate system, as shown in figure 3 . 

More specifically, three hardware-oriented color models 
are RGB, used with color CRT monitors, YIQ, i,e, the 
broadcast TV color system that is a re-coding of RGB 
transmission efficiency and for downward compatibility 

2 0 with black and white television and CMY (cyan, magenta, 
yellow) for some color-printing devices. Unfortunately 
none of these models are particularly easy to use 
because they do not relate directly to intuitive color 
notions of hue, saturation, and brightness. Therefore, 

25 another class of models has been developed with ease of 
use as a goal, such as the HSV (hue, saturation, value) 
- sometimes called HSB (hue, saturation, brightness) , 
HLS (hue, lightness, saturation) and HVC (hue, value, 
chroma) models. 

30 With each model is also given a means of converting to 
some other specification. 

As stated above, the RGB color model used in color CRT 
monitors and color raster graphics employs a Cartesian 
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coordinate system. The RGB primaries are additive 
primaries; that is the individual contributions of each 
primary are added together to yield the result. The 
main diagonal of the cube, with equal amounts of each 
5 primary, represents the gray levels: black is (0,0,0); 
white is (1,1,1). 

Following such gray line implies the change of the 
three Cartesian value R, G and B at the same time, as 
shown with a point-dotted line in figure 4A; this 
10 situation weights the computational charge of the image 
processing steps requiring the individuation of gray 
regions. 

The RGB model is hardware-oriented. By contrast HSV (as 
well as HSB or HLC) model is user-oriented, being based 
15 on the intuitive appeal of the artist's tint, shade, 
and tone. The coordinate system is cylindrical, as 
shown in figure 4B. 

The HSV model (like the HLC model) is easy to use. The 
grays all have S=0 and they can be removed from an 
20 image data raster by means of a cylindrical filter in 
proximity of the V axes, as shown in figure 5; 
moreover, the maximally saturated hues are at S=l, 
L=0.5. 

The HLS color model is a reduced model obtained from 
25 the HSV cylindrical model, as shown in figure 6/ the 
reduction of the color space is due to the fact that 
some colors cannot be saturated. Such space subset is 
defined is a hexcone or six-sided pyramid, as shown in 
figure 7. The top of the hexcone corresponds to V=l 
30 which contains the relatively bright colors. The colors 
of the V=l plane are not all of the same perceived 
brightness however. 



Hue or H, is measured by the angle around the vertical 
axis with red at 0° green at 120° and so on (see figure 
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7) , Complementary colors in the HSV hexcone are 180^ 
opposite one another. The value of S is a ratio ranging 
from 0 on the center line (V axis) to 1 on the 
triangular sides of the hexcone. 

5 The hexcone is one unit high in V, with the apex at the 
origin. The point at the apex is black and ' has a V 
coordinate of 0 . At this point, the values of H and .S 
are irrelevant. The point S=0, V=l is white. 
Intermediate values of V or S=0 (on the center line) 
10 are the grays. It is therefore immediately apparent the 
simplicity of use of the HSV or equivalent color space 
in order to obtain the gray regions. 

Adding a white pigment corresponds to decreasing S 
{without changing V) . Shades are created by keeping S=l 
15 and decreasing V. Tones are created by decreasing both 
S and V. Of course, changing H corresponds to selecting 
the pure pigment with which to start. Thus, H, S, and V 
correspond to concepts from the perceptive color 
system. 

20 The top of the HSV hexcone corresponds to the 
projection seen by looking along the principal diagonal 
of the RGB color cube from white toward black, as shown 
in figure 8 . 

In figure 9 is shown the HLS color model, which is 
25 defined in a double -hexcone subset of the cylindrical 
space. Hue is the angle around the vertical axis of the 
double hexcone, with red at 0°. The colors occur around 
the perimeter: red, yellow, green, cyan, blue and 
magenta. The HLS space can be considerated as a 
30 deformation of HSV space, in which white is pulled 
upward to form the upper hexcone from the V=l plane. As 
with the single -hexcone model, the complement of any 
hue is located 180^ farther around the double hexcone, 
and saturation is measured radially from the vertical 
35 axis form 0 on the axis to 1 on the surface. Lightness 
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is 0 for black (at the lower tip of the double hexcone) 
to 1 for white (at the upper tip) . 

Many hardware and software packages are currently 
available in the technical field of the electronic 
5 image processing which provide for image data 
processing methods and corresponding devices. However, 
it should be noted that only few, if any, operate in 
both the personal computer/work station field as well 
as in the embedded devices field. 

10 In fact, the embedded devices have a plurality of needs 
which turn into tight limitations for the image 
processing devices themselves. Particularly, the image 
processing in an embedded environment seeks : 

to reduce the size of the image data in order to 
15 limit the memory area employed by the image data 
processing devices; 

to increase the amount of any text portion 
comprised in a document that can be OCR 'able, i.e. it 
should be possible to acquire and understand such 
20 portion by means of an Optical Characters Recognitor 
(OCR) ; 

to get as final result of the image data 
processing device an image viewable and printable, 
which is close to the original acquired image. 

25 Known document analysis that tried to fit the above 
requirements have the problem of being computationally 
very heavy and not suited for embedded applications 
where processing power and memory requirements are 
stringent and important. 

3 0 So, even if these solutions may perform an acceptable 
analysis of the document, they are not applicable in an 
embedded environment . 
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The main purpose of the known document analysis is the 
extraction of features and the classification of text 
and images in the analyzed documents. Examples of 
analysis used in this technical field are known from 
5 the publication "Document Image Analysis" to L. 
O' Gorman and R. Kasturi, IEEE Computer Society Press, 
which is a collection of all the most relevant papers 
regarding document analysis. 

All the known approaches deal with the recognition of 
10 different types of areas on a page. The areas are 
normally classified into regions of text, photo and 
line art. The page is then divided into these different 
areas (normally in a mutually exclusive way) and each 
is treated in a different way« In other terms, the 
15 known document analysis deal with understanding . the 
"type" of information that is on the page. 

These solutions tend to sub-divide the page into 
mutually exclusive regions that contain different type 
of information. 

20 Other known devices deal with decomposed documents, 
i.e. documents translated into a plurality of 
elementary image information called pixels. Such 
devices provide a treatment of the decomposed document 
as a whole, or at least are able to reconstruct the 

25 information they need only reprocessing the input 
document format . 

An illustrative and not limiting example is a BW fax 
machine. If such device can deal only with BW data and 
the document contains a mixture of sparse color and BW 
3 0 data, the fax machine image processing device must be 
able to reconstruct a single BW page from the pieces of 
the decomposed original document . 



A known way to comply with the embedded environment 
requirements leads to peripheral devices that support 
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only the specified features of a particular product; 
that is how cost and performance are satisfied. 

However, none of the known solutions deals with the 

problem of maintaining the original appearance of the 

5 document, and therefore no accent is posed on the 

recognition of the color itself on the document and 
what can be done once this color content is known. 

One object of the present invention is that of 
providing a dual path distinction method for two 
10 different layers, i.e. the BW and color layer, 
identifying the features used to classify as colorful 
or not a certain group of pixel of a raster image . 

The reason for doing this can be explained in the 
following way. As an example, in a document as a 

15 magazine article, there are areas of color, for example 
photographs and colored text and highlighted areas 
which include bright colors and which a user would like 
to retain as colors. There are also areas, typically 
backgrounds areas which are either very light or dark, 

20 that even if one could argue that they have a color 
content, can be equally be well represented with only 
two colors, i.e. black and white. 

Moreover, the color information content of background 
area, even if not negligible, could be of no interest 
25 with respect to the BW content. This is the case of the 
so-called "business text": the information content of 
the image data is superimposed to a color background 
content which can be ignored, without loosing any 
useful information . 

30 After the separation between these areas, the data in 
each area could be processed differently: color data 
could be compressed in a lossy fashion, whereas the BW 
data could be binarized, and the user would not see a 
big difference in the quality of the document. 

9 
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Summary of the Invention 

The solution idea behind this invention is that of 
providing a dual path distinction method which could 
create a BW and a color layer starting from a single 
5 input data sheet. 

According to this solution idea, the invention relates 
to a document analysis method using BW/color areas 
detection as defined in the enclosed claim 1. 

Moreover, the invention relates to a scanning device, 
10 as defined in the enclosed claim 9. 

Finally, the invention relates to a method for 

acquiring a document based on the analysis of the 

content of the document itself, as defined in the 
enclosed claim 15. 

15 The features and advantages of the BW/color document 
analysis method and layers creator device according to 
the invention will be appreciated by the following 
description of a preferred embodiment given by way of 
non- limiting examples with reference to the annexed 

20 drawings. 

Brief Description of the Drawings 

Figure 1 shows an example of a spectral energy 
distribution of a colors- 
Figure 2 shows the spectral distribution of figure 1, 
25 illustrating dominant wavelength, excitation purity and 
luminance , 

1 

Figure 3" shows the 3D Cartesian representation of the 
RGB color space, with the fundamental colors; 



Figure 4A shows the RGB color space of figure 3 and the 
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gray line within; 

Figure 4B shows the cylindrical representation of the 
HSV/HLC color space; 

Figure 5 shows a gray filter for the HSV/HLC color 
5 spacer- 
Figure 6 shows the HLS color space; 

Figure 7 shows the s ingle -hexcone representation of HSV 
color space; 

Figure 8 shows a section of figure 7; 

10 Figure 9 shows the double-hexcone representation of the 
HSV color space; 

Figure 10 shows schematically a document analysis 
method according to the present invention; 

Figure llA shows schematically a dual path layer 
15 creator implementing such method according to the 
present invention ; 

Figure IIB shows more particularly the dual path layer 
creator of figure llA. 

Figure 12 shows resulting layers from the dual path 
20 layer creator of figure llA. 

Figures 13A, 13B and 13C show a first atomic operation 
used in the method according to the present invention 
and its implementation; 

Figure 14 shows a particular result for the atomic 
25 operation of figures 13A, 13B and 13C; 

Figure 15 shows another example of atomic operation 
used in the method according to the present invention; 

Figure 16 shows another example of atomic operation 
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used in the method according to the present invention; 

Figure 17 shows more particularly a PDF application of 
the method according to the present invention. 

Detailed Description of the Invention 

5 The basic idea underlying present application is that 
of processing a document in order to provide distinct 
BW and color layers. Starting from a color page 
acquired with a scanner capable to deliver color data, 
the first step is to understand where there is a color 
10 on the color page. For this purpose such miethod uses 
colorfulness and region groupings for document 
analysis. 

In particular, the document analysis method according 
to the present invention comprises the following steps: 

15 1. Getting a color image input data, for example a 
pixels raster format . 

2. Calculating and extracting from the input data the 
colorfulness of each pixel. 

3 . Creating a first and a second output layers 
20 corresponding to the BW and color pixels respectively. 

4, Applying a first set of atomic operations to the 
BW layer and a second different set of atomic 
operations to the color layer, 

5 . Combining the BW and color layers in order to 
25 obtain a desired format for the output data. 

The purpose of such document analysis method is that of 

distinguish between text and image; so, the method 

according to the present invention comprises a first 

path BW PATH that detects the BW pixels in order to 

30 assemble a first layer TEXT containing the portion of 

text comprised in the input data and a second path 
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COLOR PATH that detects the color pixels in order to 
assemble a second layer IMAGE containing the portion of 
image comprised in the input data. 

In figure 10, the method according to the present 
5 invention is shown in terms of pipelines, i.e. in term 
of "paths" of atomic operations to be performed on the 
input data in order to obtain a particular output 
format. More particularly, figure 10 shows the two 
different output representation, TEXT and IMAGE, 
10 obtained by means of said first and second data paths, 
BW PATH and COLOR PATH respectively. 

The pipelines or paths define the sequence of atomic 
operations to be performed on the input image data. 
Such atomic operations, which are individually known in 
15 the field of image data processing, can be grouped 
together to generate a plurality of IP (Image 
Processing) tools. 

In order to obtain the BW/color layer distinction 
according to the above method, may be used the 
20 following IP tools: 

a transformation of an image pixel from the RGB 
format to another image space format, for example the 
HLS (or HLN, for hue, lightness, chroma indicator N) 
format; 

25 - a grouping function that associates elementary 
information in order to obtain an unique information 
group to be processed, as the blobbing technique; 

a down sampling function; 

a thresholding function; 

30 - an AND/OR and other data extraction function; 

compression functions (in particular, the G4 or 

13 
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JPEG compression method) . 

Figure llA shows a dual path layer creator 1, receiving 
an input data 2 and outputting a first 3 and a second 
layer 4 of organized output data. 

5 The input data 2 are in some color format, for e.g., 
but is not limited to, the RGB format, and they are 
organized under the form of color pixels raster. It is 
possible obtaining such data format by means, for 
example, of a scanner. 

10 As an example, from a 3 0pdpi 24bpp (bit per pixels) 
color input image, the dual path layer creator 1 
outputs a 300dpi 8bpp [bottom layer] representation of 
the input data, as well as a ISOpdi 24bpp color 
representation [top layer] . The effect of this is that 

15 instead of having to process the 3 00dpi, 24bpp data of 
24Mbyte, only 300dpi Bbpp = 8Mbyte and 150dpi, 24bpp = 
6Mbyte of data have to be processed, the sum of 14 
Mbyte being much less than the original 24Mbyte. 

Advantageously, according to the present invention, the 
20 dual path layer creator 1 produces the two document 
layers 3 and 4 simultaneously. 

Particularly, such layers 3 and 4 have a different 
resolution, as shown in figure 12. As an example, a 
resolution of 150dpi is used when the color information 
25 of the image input data is needed, e.g. in case of a 
graphic representation of the output data (color 
format) . Moreover, a resolution of 300dpi is needed for 
the OCR'ble portion text of the image (BW format) . 

It should be noted that the sum of the sizes of the 
30 color layer at 150dpi plus the BW layer at 300dpi is 
lower than the size of the color layer at 300dpi. 

In this aim, according to the present invention, a 

method for acquiring a document based on the analysis 
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of the content of the document itself, comprises the 
following steps : 

getting an input image data; 

creating a first layer containing the image 
5 information in a color format; 

creating a second layer containing the image 
information in a BW format; 

managing the first and the second layers in order 
to obtain a desired format for an output document. 

10 In order to further reduce the memory area 
requirements, a dual path processing line 5 for the BW 
and color layers obtained by the dual path layer 
creator 1 is shown schematically in figure IIB. 

The dual path processing line 5 comprises the dual path 
15 layer creator 1 receiving an input data 2 and 
output ting the first 3 and the second output data layer 
4. 

The first output data layer 3 is then processed by 
means of a series of a thresholder 6 and a G4 
20 compressor 7, while the second output data layer 4 is 
only compressed by means of a JPEG compressor 8 . The 
output data layer so processed are then forwarded to an 
output reconstruct block 9, that provide the required 
output format . 

25 As an example, from the 300dpi 8bpp and the ISOpdi 
24bpp representations obtained from the dual path layer 
creator 1, the thresholder 6 produces a 3 00dpi Ibpp 
representation and the G4 compressor 7 a SOkByte bottom 
layer, while the JPEG compressor 8 produces a 250kByte 

30 top layer. The effect is this of having to process only 
80+250=330kByte of output data, instead of the 24MByte 
of input data. 

15 
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As an example, creating color and BW layers, as 
previously suggested, decreases the memory area 
requirements. Moreover, elaborating layers having 
reduced sizes increases the elaboration speed, so 
5 enhancing the processing line's performance as a whole. 

Figure 12 shows the resulting compressed layers (TEXT 
and IMAGE) from the dual path layer creator 1 of figure 
llA. 

More particularly, in order to distinguish regions on a 
10 page that are colorful from regions that are not such, 
a BW/color areas detection document analysis method 
according to the above indication comprises the 
following steps: 

1, Analysing the input data in a Chroma space format. 

15 2. Calculating and extracting from the input data the 
colorfulness of each pixel. 

3. Down sampling Chroma indication channel. 

4 . Applying a threshold to the down sampled data . 

5. Label ON the pixels having a colorfulness above 
20 the threshold and OFF the pixels having a colorfulness 

lower than the threshold. 

According to the present invention, the BW/color areas 
detection document analysis method should be improved 
adding the further step of : 

25 6. Grouping the color information of single small 
groups of pixels, improving the compression of the 
document and enabling the elimination of small groups 
of pixels that are still considered to be due to noise, 
or in any case, of insignificant size. 

30 As shown in figure 13A, the BW/color areas detection 
document analysis method comprises a down sampling and 
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a thresholding step. Once the colorfulness of each 
pixel has been calculated and extracted from the input 
data (FORMAT 1) , the image data are down sampled. 

The down sampling algorithm can be performed in various 
5 ways that all have a different effect on the 
performance of the algorithm. If down sampling by 
taking the average value in a neighborhood is used, a 
lot of document (small regions of color due to the 
printing process for e.g.) and scanner noise (jitter on 
10 the RGB signal) can be eliminated. 

The down sampled image data are then selected by means 
of a threshold, so obtaining an image having a pixels 
format without the spike noise (FORMAT 2) . In fact, the 
output data are considered ON if colorfulness of the 
15 input image pixel is above the threshold and OFF if it 
is lower than the threshold. 

The simplest case consists in applying a fixed 
threshold. The output of this stage is a down sampled 
version of the original image that has ON pixels in 
20 those regions where the color content of the original 
image were above the color threshold. 

Different threshold values can be considerated with 
reference to different final devices. 

For example, a low resolution display does not need to 
25 receive a 16 billion color image data, since such 
device have no possibility of elaborating and 
displaying this kind of complex image data. An image 
data forwarded to a low resolution display can be 
obtained by means of particular thresholding values by 
30 limiting the number of available colors, e.g. filtering 
pale colors and transforming then into white or 
"clustering" different type of "reds" in order to have 
only one "red" . 
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An example of an acceptable range for the threshold 
values is 0 to 30 for an input image data of 0 to 255. 

The more complex case accumulates a histogram of the 
color content of the page and, using a heuristic, 
5 decides what the best threshold for the page is. 

The color information of single small groups of pixels 
can be further grouped together using known simple 
grouping techniques. The grouping step is performed on 
data of connected components, as shown in figure 13A 

10 (FORMAT 3) . This has the advantage of grouping regions 
of pixels that are considered colorful into bigger 
group. In this way, when the regions of colorful pixels 
are compressed (in a later stage) , not every pixel has 
to be compressed singularly. The compression of a 

15 larger group of pixels is more efficient than the 
singular compression of each single colorful region. 

The grouping of pixels also has the advantage of 
enabling the elimination of small groups of pixels that 
are still considered to be due to noise, or in any 
20 case, of insignificant size. 

The whole procedure that has been depicted can also be 
performed on a strip basis on the whole original image, 
without any modification, as shown in figure 13B. Strip 
based analysis produces a nice side effect on the 
25 grouping of pixels. In fact, if the grouping of pixels 
is performed on a strip basis, the grouping of pixels 
enables an approximation of the contour of colored 
regions, as shown in figure 14. 

Figure 13C shows a down sampling/ thresholding/grouping 
30 device 10. The first component is an RGB to Chroma 
space converter 11, It converts, on a pixel by pixel 
basis, the color representation of every input pixel 
into a different color space representation. 
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When the aim is making decisions on the colorfulness of 
a pixel, using the right color space representation is 
important . 

The RGB color space has been found not very convenient 
5 for this type of analysis. The color space used should 
have an indication of the colorfulness of the specific 
pixel. The HLN (Hue /Lightness /chroma indicator N) color 
space was found particularly convenient and is used in 
the current realization. 

10 The indication of Chroma , in this HLN color space, is 
directly the content of the N channel, where N = max 
{R,G,B) - min (R,G,B) . 

The down sampling/ thresholding/grouping device 10 
further comprises a down sampler 12, that down samples 
15 the N channel, and a thresholding device 13, in turn 
comprising a threshold selector 14 and a look-up- table 
LUT 15 which apply a threshold to the down sampled 
data. 

The output data is considered ON if colorfulness of the 
20 input image pixel is above the threshold and OFF if it 
is lower than the threshold. 

Moreover, the color information of single small groups 
of pixels can be further grouped together by means of a 
grouping block 16, using known simple grouping 
25 techniques on data of connected components, in order to 
improve the compression of the image data and eliminate 
small groups of pixels that are still considered to be 
due to noise, or in any case, of insignificant size. 

It should be noted that a HLS to RGB converter (not 
3 0 shown in figure 13C) can also be added to the down 
sampling/thresholding/grouping device 10 in order to 
obtain RGB output data. 

A simple grouping procedure called blobbing can be used 
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in order to extract the images from a document, as 
shown in figure 15, where the blobbed regions should 
correspond to the images of the document. 

Moreover, after the recognition of the color fulness 
5 content of each pixel and the blobbing of the image 
data, the singular images in the processed document can 
be separated, as shown in figure 16, by means of an AND 
function of such data and a mask, duly created on the 
basis of the following relationships: 

10 RGB AND WHITE = RGB 

RGB AND BLACK = BLACK. 

A data processor 17 for obtaining processed color and 
BW layers is shown more precisely in figure 17 . The 
processed color and BW layers so obtained can be used, 
15 for example, in a PDF representation of the input image 
data. Particularly, the data processor is a scanning 
device 17, 

The scanning device 17 has an input IN that receives 
the raster image data, for example in the RGB format, 
20 and is connected to a dual path layer creator component 
18, which in turn outputs a first and a second layer, 
such layers having different data compression rate. 

The first layer is inputted in a BW path 19 that 
outputs a processed BW layer. In a similar manner, the 
25 second layer is inputted in a color path 2 0 that 
outputs a processed color layer. The processed BW and 
color layers are finally inputted in a PDF device 21, 

More particularly, in the example shown in figure 17, 
the BW path 19 comprises a threshold block 22 connected 
30 in series to a compressor 23. A G4 compression is often 
used in the field of BW image data processing. 

Moreover, also in figure 17, the color path 20 

comprises a RGB to HLN converter 24, having the output 

20 
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N channel connected to a series of a down scale device 
25/ an histogram and threshold selector 26, a look-up- 
table 27, a blob analysis block 28, a fill regions 
block 23 and a compressor 30. A JPEG compression is 
5 often used in the field of color image data processing. 

There are several advantages of the document analysis 
method using BW/color areas detection and scanning 
device according to the present invention: 

1. Enables different representations of raster data 
10 be present together for different uses. An example is a 

3 00dpi G4 compressed BW layer that lies under a 150dpi 
JPEG compressed color layer. The color layer is more 
pleasant to the eye but an OCR (Optical Characters 
Recognition) procedure could not be performed on the 
15 text in the JPEG layer. With this approach, the OCR may 
be applied to the BW data layer instead. 

2. Enables a compromise between file size and use of 
the document itself . Taking into account the example 
above, a JPEG compressed 300dpi page (which can be used 

20 for OCR) will result in about 600 Kbyte. A G4 
compressed 300dpi BW page will result in about 80 Kbyte 
but is lacking color information. If a 3 00dpi BW layer 
is overlaid with a 150dpi JPEG layer, then the 
resulting file size would be of about 250 Kbyte + 80 

25 Kbyte = 330 Kbyte. The resulting document would have 
all the characteristics of the 600dpi JPEG compressed 
version (OCR'able), but has half the file size. 

3. In the case of selectively adding color 
information on the page, this approach has the 

30 advantages that, when an error is made by the algorithm 
and color was not added where it should have been, one 
always has a representation of the original data 
underneath (even if only BW data) , and therefore no 
document content is lost. 
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4 . The scanning device according to the present 
invention provides for an efficient way to produce this 
dual representation. 

5. In a large series of cases, this approach achieve 
good compression ratios, without sacrificing the 
original information that is present in the 'document 
(for e.g. the extraction of text data for OCR), A layer 
can be added for preview, for example, without 
sacrificing size of the document, 

6. The amount of data to process is highly reduced. 
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